This analysis was inspired by part 5 of Julia Silge’s Test mining with tidy data principles course. The data set was curated by Karylin Pavlik and contains songs that listed on Billboard’s Year-End Hot 100 throughout five decades.
The variables in this dataset are the following:
| Variable | Detail |
|---|---|
| rank | the rank a song achieved on the Billboard Year-End Hot 100 |
| song | the song’s title |
| artist | the artist who recorded the song |
| year | the year the song reached the given rank on the Billboard chart |
| lyrics | the lyrics of the song |
The dataset consists of more than 5000 songs, spanning from 1985 to 2015. The lyrics are in one column, so we need to convert it into tidy format. We can do so by using the unnest_token() function, tokenising and tidying the lyrics, creating a new word column:
tidy_lyrics <- lyrics %>%
# transform lyrics into word column
unnest_tokens(word, Lyrics)
head(tidy_lyrics) %>% kable()
| Rank | Song | Artist | Year | Source | word |
|---|---|---|---|---|---|
| 1 | wooly bully | sam the sham and the pharaohs | 1965 | 3 | sam |
| 1 | wooly bully | sam the sham and the pharaohs | 1965 | 3 | the |
| 1 | wooly bully | sam the sham and the pharaohs | 1965 | 3 | sham |
| 1 | wooly bully | sam the sham and the pharaohs | 1965 | 3 | miscellaneous |
| 1 | wooly bully | sam the sham and the pharaohs | 1965 | 3 | wooly |
| 1 | wooly bully | sam the sham and the pharaohs | 1965 | 3 | bully |
We might be interested in what the most common words in the song lyrics are. We can see that words like ‘you’,‘the’ and ‘my’ are among the most common words. In contrast, words like ‘bottle’, ‘thang’ and ‘american’ are the least common words:
tidy_lyrics %>%
count(word, sort=TRUE) %>% head()
word n
1 you 64606
2 i 56472
3 the 53451
4 to 35752
5 and 32555
6 me 31170
tidy_lyrics %>%
count(word, sort=TRUE) %>% tail()
word n
42152 zoomed 1
42153 zooms 1
42154 zooped 1
42155 zucchinis 1
42156 zulu 1
42157 zwei 1
The relationship between number of words in songs being released by artists over the decades, seems to be positively correlated, that is, as the years increases so too does the number of words in songs.
tidy_lyrics %>%
count(Year,Song) %>%
filter(n>1) %>% # Filter to include words appearing more than only one time
ggplot(aes(Year,n))+
geom_point(alpha=0.4, size=5,color="orange")+
geom_smooth(method="lm", color="black")+
theme_minimal()
Lets try to figure out which songs have very few or very many words:
tidy_lyrics %>%
count(Year, Song) %>%
arrange(-n) %>% # modify to raange(n) to display the songs with least words
head()
Year Song n
1 2007 im a flirt 1156
2 1998 been around the world 1149
3 2009 forever 1050
4 2010 forever 1050
5 2003 air force ones 1042
6 1988 dont be cruel 1038
We can extract a song of interest using filter(), in this case we filter for “wipe out” by The Surfaris:
lyrics %>%
filter(Song == "wipe out")
Rank Song Artist Year Lyrics Source
1 63 wipe out the surfaris 1966 wipe out ha ha ha 1
In order to explore the evolution of pop song vocabulary over the decades we can build some linear models. The first step is to create a data set of word counts. This involves counting the number of words used in each song each year, group the data by year and create a new column containing the total words used each year. Finally we filter the data set to only include words above 500 total uses, as we don’t want to train models with words that are used sparingly:
word_counts <- tidy_lyrics %>%
anti_join(get_stopwords()) %>%
count(Year, word) %>%
# group by `year`
group_by(Year) %>%
# create a new column for the total words per year
mutate(year_total = sum(n)) %>%
ungroup() %>%
# now group by `word`
group_by(word) %>%
# keep only words used more than 500 times
filter(sum(n) > 500) %>%
ungroup()
word_counts
# A tibble: 14,791 × 4
Year word n year_total
<int> <chr> <int> <int>
1 1965 ah 30 9845
2 1965 aint 47 9845
3 1965 alone 5 9845
4 1965 alright 7 9845
5 1965 always 14 9845
6 1965 another 12 9845
7 1965 anything 2 9845
8 1965 arms 18 9845
9 1965 around 21 9845
10 1965 away 30 9845
# … with 14,781 more rows
Now that we have our data set, we can use it to train many models, one per word. The broom package enables us to handle the model output. The creation of models involves creating list columns by nesting the word count data by word. We then use mutate() to create a new column for the models, thereby training a model for each word, where the number of successes (word counts) and failures (total counts per year) are predicted year:
library(broom)
slopes <- word_counts %>%
nest_by(word) %>%
# create a new column for our `model` objects
mutate(model = list(glm(cbind(n, year_total) ~ Year,
family = "binomial", data = data))) %>%
summarize(tidy(model)) %>%
ungroup() %>%
# filter to only keep the "year" terms
filter(term == "Year") %>%
mutate(p.value = p.adjust(p.value)) %>%
arrange(estimate)
slopes
# A tibble: 297 × 6
word term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 woman Year -0.0446 0.00254 -17.5 1.93e-66
2 lovin Year -0.0440 0.00289 -15.2 8.70e-50
3 morning Year -0.0409 0.00308 -13.3 7.07e-38
4 sweet Year -0.0397 0.00207 -19.1 3.56e-79
5 easy Year -0.0377 0.00310 -12.2 1.48e-31
6 loves Year -0.0337 0.00304 -11.1 3.13e-26
7 lonely Year -0.0309 0.00252 -12.2 4.63e-32
8 help Year -0.0297 0.00253 -11.8 1.59e-29
9 old Year -0.0280 0.00255 -11.0 1.41e-25
10 people Year -0.0275 0.00221 -12.4 3.97e-33
# … with 287 more rows
We can use a volcano plot to visualise all the models we trained. This type of plot allows us to compare the effect size and statistical significance
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] plotly_4.9.4.1 broom_0.7.9 knitr_1.33 tidytext_0.3.1
[5] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
[9] readr_2.0.1 tidyr_1.1.3 tibble_3.1.4 ggplot2_3.3.5
[13] tidyverse_1.3.1
loaded via a namespace (and not attached):
[1] httr_1.4.2 sass_0.4.0 viridisLite_0.4.0
[4] jsonlite_1.7.2 splines_4.1.0 modelr_0.1.8
[7] bslib_0.3.0 assertthat_0.2.1 highr_0.9
[10] cellranger_1.1.0 yaml_2.2.1 pillar_1.6.2
[13] backports_1.2.1 lattice_0.20-44 glue_1.4.2
[16] digest_0.6.27 rvest_1.0.1 colorspace_2.0-2
[19] htmltools_0.5.2 Matrix_1.3-4 pkgconfig_2.0.3
[22] haven_2.4.3 scales_1.1.1 distill_1.2
[25] tzdb_0.1.2 downlit_0.2.1 mgcv_1.8-36
[28] generics_0.1.0 farver_2.1.0 ellipsis_0.3.2
[31] pacman_0.5.1 withr_2.4.2 lazyeval_0.2.2
[34] cli_3.0.1 magrittr_2.0.1 crayon_1.4.1
[37] readxl_1.3.1 evaluate_0.14 stopwords_2.2
[40] tokenizers_0.2.1 janeaustenr_0.1.5 fs_1.5.0
[43] fansi_0.5.0 nlme_3.1-152 SnowballC_0.7.0
[46] xml2_1.3.2 data.table_1.14.0 tools_4.1.0
[49] hms_1.1.0 lifecycle_1.0.0 munsell_0.5.0
[52] reprex_2.0.1 compiler_4.1.0 jquerylib_0.1.4
[55] rlang_0.4.11 grid_4.1.0 rstudioapi_0.13
[58] htmlwidgets_1.5.3 crosstalk_1.1.1 labeling_0.4.2
[61] rmarkdown_2.10 gtable_0.3.0 DBI_1.1.1
[64] R6_2.5.1 lubridate_1.7.10 fastmap_1.1.0
[67] utf8_1.2.2 stringi_1.7.4 Rcpp_1.0.7
[70] vctrs_0.3.8 dbplyr_2.1.1 tidyselect_1.1.1
[73] xfun_0.25